Manipulating, processing, cleaning, and crunching data in Python.
Essential Python Libraries
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
General tasks
In [2]:
data_path = 'data/usagov_bitly_data2012-03-16-1331923249.txt'
open(data_path).readline()
Out[2]:
In [3]:
import json
with open(data_path, 'r') as data_file:
records = [json.loads(line) for line in data_file]
In [4]:
len(records)
Out[4]:
In [5]:
type(records)
Out[5]:
In [6]:
records[0]
Out[6]:
In [7]:
type(records[0])
Out[7]:
In [8]:
records[0].keys()
Out[8]:
In [9]:
time_zones = [record['tz'] for record in records if 'tz' in record]
In [10]:
len(time_zones)
Out[10]:
In [11]:
time_zones[:10]
Out[11]:
In [12]:
from collections import Counter
time_zone_counter = Counter(time_zones)
time_zone_counter
Out[12]:
In [13]:
sorted_time_zone_counter = time_zone_counter.most_common()
In [14]:
sorted_time_zone_counter
Out[14]:
In [15]:
df = pd.DataFrame(records)
In [16]:
df.head()
Out[16]:
In [17]:
df.tz.value_counts().head()
Out[17]:
In [18]:
clean_tz = df.tz.fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'
In [19]:
tz_counts = clean_tz.value_counts()
tz_counts.head()
Out[19]:
In [20]:
tz_counts.head(10).plot(kind='barh')
Out[20]:
In [21]:
df.shape
Out[21]:
In [22]:
clean_df = df[df.a.notnull()]
clean_df.shape
Out[22]:
In [23]:
operating_system = np.where(clean_df.a.str.contains('Windows'), 'Windows', 'Not Windows')
operating_system
Out[23]:
In [24]:
operating_system.shape
Out[24]:
In [25]:
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('data/ml-1m/users.dat', delimiter='::', header=None, names=unames, engine='python')
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('data/ml-1m/ratings.dat', delimiter='::', header=None, names=rnames, engine='python')
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('data/ml-1m/movies.dat', delimiter='::', header=None, names=mnames, engine='python')
In [26]:
users.head()
Out[26]:
In [27]:
ratings.head()
Out[27]:
In [28]:
movies.head()
Out[28]:
In [29]:
data = pd.merge(pd.merge(ratings, users), movies)
data.head()
Out[29]:
In [30]:
data.dtypes
Out[30]:
In [31]:
data.loc[0, :]
Out[31]:
In [32]:
data.columns
Out[32]:
In [33]:
data.loc[0, 'user_id']
Out[33]:
In [34]:
mean_ratings = data.pivot_table('rating', index='title', columns='gender')
mean_ratings.head()
Out[34]:
In [35]:
ratings_by_title = data.groupby('title').size().sort_values(ascending=False)
ratings_by_title.head()
Out[35]:
In [36]:
active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles.size
Out[36]:
In [37]:
mean_ratings = mean_ratings.loc[active_titles, :]
mean_ratings.head()
Out[37]:
In [38]:
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
top_female_ratings.head()
Out[38]:
In [39]:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
mean_ratings.head()
Out[39]:
In [40]:
mean_ratings.sort_values(by='diff', ascending=False).head()
Out[40]:
In [41]:
rating_std_by_title = data.groupby('title').rating.std()
rating_std_by_title.head()
Out[41]:
In [42]:
rating_std_by_title.loc[active_titles].sort_values(ascending=False).head()
Out[42]:
In [43]:
data_path = 'data/names/yob2015.txt'
names2015 = pd.read_csv(data_path, names=['name', 'sex', 'births'])
names2015.head()
Out[43]:
In [44]:
names2015.groupby('sex').births.sum()
Out[44]:
In [45]:
pieces = []
years = range(1880, 2016)
col_names = ['name', 'sex', 'births']
for year in years:
path = 'data/names/yob%d.txt' % year
frame = pd.read_csv(path, names=col_names)
frame['year'] = year
pieces.append(frame)
names = pd.concat(pieces, ignore_index=True)
In [46]:
names.shape
Out[46]:
In [47]:
names.head()
Out[47]:
In [48]:
total_births = names.pivot_table('births', index='year', columns='sex', aggfunc=sum)
total_births.plot()
Out[48]:
Key features
$ ipython qtconsole --pylab=inline
$ ipython --pylab
The previous two outputs are stored in the _ (one underscore) and __ (two underscores) variables, respectively. Input variables are stored in variables named like _iX, where X is the input line number
IPython is capable of logging the entire console session including input and output. Logging is turned on by typing %logstart.
Companion functions %logoff, %logon, %logstate, and %logstop.
One can perform most standard command line actions as you would in the Windows or UNIX (Linux, OS X) shell without having to exit IPython. This includes executing shell commands, changing directories, and storing the results of a command in a Python object (list or string).
Starting a line in IPython with an exclamation point !, or bang, tells IPython to execute everything after the bang in the system shell.
In [49]:
!ls
The %alias magic function can define custom shortcuts for shell commands.
In [50]:
%alias ll ls -l
In [51]:
ll /usr
IPython has a simple directory bookmarking system to enable you to save aliases for common directories so that you can jump around very easily.
In [52]:
%bookmark notebooks /usr/local/src/notebooks
In [53]:
cd notebooks
IPython has closely integrated and enhanced the built-in Python pdb debugger. IPython has easy-to-use code timing and profiling tools.
The %debug command, when entered immediately after an exception, invokes the “post-mortem” debugger and drops you into the stack frame where the exception was raised.
Executing the %pdb command makes it so that IPython automatically invokes the debugger after any exception, a mode that many users will find especially useful.
%time runs a statement once, reporting the total execution time.
In [54]:
strings = ['foo', 'foobar', 'baz', 'qux', 'python', 'Guido Van Rossum'] * 100000
In [55]:
%time method1 = [x for x in strings if x.startswith('foo')]
In [56]:
%time method2 = [x for x in strings if x[:3] == 'foo']
To get a more precise measurement, use the %timeit magic function. Given an arbitrary statement, it has a heuristic to run a statement multiple times to produce a fairly accurate average runtime.
In [57]:
%timeit [x for x in strings if x.startswith('foo')]
In [58]:
%timeit [x for x in strings if x[:3] == 'foo']
Profiling code is closely related to timing code, except it is concerned with determining where time is spent. The main Python profiling tool is the cProfile module, which is not specific to IPython at all. cProfile executes a program or any arbitrary block of code while keeping track of how much time is spent in each function.
In [59]:
# python -m cProfile -s cumulative cprof_example.py
%prun takes the same “command line options” as cProfile but will profile an arbitrary Python statement instead of a while .py file
%lprun that computes a line-by-line-profiling of one or more functions.
It has a JSON-based .ipynb document format that enables easy sharing of code, output, and figures. The notebook application runs as a lightweight server process on the command line.
All of these configuration options are specified in a special ipython_config.py file which will be found in the ~/.config/ipython/ directory on UNIX-like systems and %HOME %/.ipython/ directory on Windows.
NumPy, short for Numerical Python, is the fundamental package required for high performance scientific computing and data analysis.
Features:
One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large data sets in Python. An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array
In [60]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1
Out[60]:
In [61]:
arr1.shape
Out[61]:
In [62]:
arr1.dtype
Out[62]:
In [63]:
np.arange(15)
Out[63]:
In [64]:
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
numeric_strings.astype(float)
Out[64]:
Arrays are important because they enable you to express batch operations on data without writing any for loops. This is usually called vectorization.
In [65]:
arr = np.zeros((2, 2, 2))
arr[:1, :1, :1]
Out[65]:
In [66]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7, 4)
In [67]:
data[names == "Bob", :]
Out[67]:
Note: The Python keywords and and or do not work with boolean arrays.
To select out a subset of the rows in a particular order, you can simply pass a list or ndarray of integers specifying the desired order
In [68]:
arr = np.arange(15).reshape((3, 5))
arr
Out[68]:
In [69]:
arr.T
Out[69]:
In [70]:
arr.transpose((1, 0))
Out[70]:
A universal function, or ufunc, is a function that performs elementwise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.
Many ufuncs are simple elementwise transformations, like sqrt or exp
Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions that might otherwise require writing loops. This practice of replacing explicit loops with array expressions is commonly referred to as vectorization.
The numpy.where function is a vectorized version of the ternary expression x if condition else y
A set of mathematical functions which compute statistics about an entire array or about the data along an axis are accessible as array methods. Aggregations (often called reductions) like sum, mean, and standard deviation std can either be used by calling the array instance method or using the top level NumPy function
In [71]:
arr = np.random.randn(100)
(arr > 0).sum()
Out[71]:
In [72]:
bools = np.array([False, False, True, False])
In [73]:
bools.any()
Out[73]:
In [74]:
bools.all()
Out[74]:
Like Python’s built-in list type, NumPy arrays can be sorted in-place using the sort method. Multidimensional arrays can have each 1D section of values sorted in-place along an axis by passing the axis number to sort. The top level method np.sort returns a sorted copy of an array instead of modifying the array in place.
np.unique returns the sorted unique values in an array.
np.save and np.load are the two workhorse functions for efficiently saving and loading array data on disk. Arrays are saved by default in an uncompressed raw binary format with file extension .npy.
np.loadtxt and np.genfromtxt are two functions to load data into vanilla NumPy arrays.
numpy.linalg has a standard set of matrix decompositions and things like inverse and determinant. These are implemented under the hood using the same industry-standard Fortran libraries used in other languages like MATLAB and R, such as like BLAS, LA- PACK, or possibly (depending on your NumPy build) the Intel MKL
The numpy.random module supplements the built-in Python random with functions for efficiently generating whole arrays of sample values from many kinds of probability distributions.
In [1]:
import pandas as pd
from pandas import Series, DataFrame
A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.
In [2]:
obj = Series([4, 7, -5, 3])
obj
Out[2]:
A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index)
In [3]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame
Out[3]:
pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index. Index objects are immutable and thus can’t be modified by the user
In [4]:
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
Out[4]:
reindex is to create a new object with the data conformed to a new index.
In [79]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
Out[79]:
In [80]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
Out[80]:
Dropping one or more entries from an axis is easy if you have an index array or list without those entries.
In [81]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
Out[81]:
In [82]:
obj.drop('c', axis=0)
Out[82]:
Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers.
In [83]:
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
Out[83]:
In [84]:
obj['b']
Out[84]:
In [85]:
obj.loc['b']
Out[85]:
In [86]:
obj.ix['b']
Out[86]:
In [87]:
obj.iloc[1]
Out[87]:
One of the most important pandas features is the behavior of arithmetic between objects with different indexes. When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.
In [5]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
In [6]:
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
In [7]:
s1 + s2
Out[7]:
In [8]:
s1.add(s2, fill_value=0)
Out[8]:
In [10]:
import numpy as np
frame = DataFrame(np.random.randn(4, 3), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])
In [11]:
np.abs(frame)
Out[11]:
In [13]:
f = lambda x: x.max() - x.min()
In [14]:
frame.apply(f)
Out[14]:
In [16]:
frame.apply(f, axis=1)
Out[16]:
In [17]:
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
In [18]:
obj
Out[18]:
In [19]:
obj.sort_index()
Out[19]:
In [20]:
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
In [23]:
frame.sort_values('b')
Out[23]:
Up until now all of the examples I’ve showed you have had unique axis labels (index values). While many pandas functions (like reindex ) require that the labels be unique, it’s not mandatory.
In [24]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
In [25]:
obj['a']
Out[25]:
In [35]:
df = DataFrame(np.random.randn(4, 3), index=list("abcd"), columns=list("efg"))
In [36]:
df.sum()
Out[36]:
In [37]:
df.sum(axis=1)
Out[37]:
In [38]:
df.mean()
Out[38]:
In [39]:
df.describe()
Out[39]:
In [40]:
df = DataFrame(np.random.randn(4, 3), index=list("0123"), columns=list("abc"))
In [41]:
df
Out[41]:
In [42]:
df.corr()
Out[42]:
In [44]:
df.cov()
Out[44]:
In [45]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
In [46]:
obj.unique()
Out[46]:
In [47]:
obj.value_counts()
Out[47]:
In [48]:
obj.isin(['b', 'c'])
Out[48]:
pandas uses the floating point value NaN (Not a Number) to represent missing data in both floating as well as in non-floating point arrays. It is just used as a sentinel that can be easily detected.
In [49]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
In [50]:
string_data.isnull()
Out[50]:
In [53]:
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()
Out[53]:
Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most purposes, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value.
In [57]:
data.fillna(0)
Out[57]:
Hierarchical indexing is an important feature of pandas enabling you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form.
In [58]:
data = Series(np.random.randn(10), index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
In [59]:
data
Out[59]:
In [60]:
data.index
Out[60]:
In [61]:
frame = DataFrame(np.arange(12).reshape((4, 3)),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']])
In [66]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
In [67]:
frame
Out[67]:
In [68]:
frame.swaplevel("key1", "key2")
Out[68]:
In [69]:
frame.swaplevel(0, 1).sortlevel(0)
Out[69]:
In [70]:
frame.sum(level='key2')
Out[70]:
In [71]:
frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
'd': [0, 1, 2, 0, 1, 2, 3]})
In [73]:
frame
Out[73]:
In [72]:
frame.set_index(['c', 'd'])
Out[72]:
Input and output typically falls into a few main categories: reading text files and other more efficient on-disk formats, loading data from databases, and interacting with network sources like web APIs.
pandas features a number of functions for reading tabular data as a DataFrame object. read_csv and read_table are likely the ones you’ll use the most. The options for these functions fall into a few categories:
Type inference is one of the more important features of these functions; that means you don’t have to specify which columns are numeric, integer, boolean, or string.
When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.
If you want to only read out a small number of rows (avoiding reading the entire file), specify that with nrows.
Using DataFrame’s to_csv method, we can write the data out to a comma-separated file. Series also has a to_csv method.
Most forms of tabular data can be loaded from disk using functions like pandas.read_table . In some cases, however, some manual processing may be necessary. It’s not uncommon to receive a file with one or more malformed lines that trip up read_table. For any file with a single-character delimiter, you can use Python’s built-in csv module. To use it, pass any open file or file-like object to csv.reader.
JSON (short for JavaScript Object Notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more flexible data format than a tabular text form like CSV.
In [76]:
obj = """{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"},
{"name": "Katie", "age": 33, "pet": "Cisco"}]
}"""
In [77]:
obj
Out[77]:
JSON is very nearly valid Python code with the exception of its null value null and some other nuances (such as disallowing trailing commas at the end of lists). The basic types are objects (dicts), arrays (lists), strings, numbers, booleans, and nulls. All of the keys in an object must be strings. There are several Python libraries for reading and writing JSON data. I’ll use json here as it is built into the Python standard library. To convert a JSON string to Python form, use json.loads.
json.dumps on the other hand converts a Python object back to JSON
Python has many libraries for reading and writing data in the ubiquitous HTML and XML formats. lxml (http://lxml.de) is one that has consistently strong performance in parsing very large files. lxml has multiple programmer interfaces; first I’ll show using lxml.html for HTML, then parse some XML using lxml.objectify.
One of the easiest ways to store data efficiently in binary format is using Python’s built-in pickle serialization. Conveniently, pandas objects all have a save method which writes the data to disk as a pickle.
You read the data back into Python with pandas.load , another pickle convenience function.
There are a number of tools that facilitate efficiently reading and writing large amounts of scientific data in binary format on disk. A popular industry-grade library for this is HDF5, which is a C library with interfaces in many other languages like Java, Python, and MATLAB. The “HDF” in HDF5 stands for hierarchical data format. Each HDF5 file contains an internal file system-like node structure enabling you to store multiple datasets and supporting metadata. Compared with simpler formats, HDF5 supports on-the-fly compression with a variety of compressors, enabling data with repeated patterns to be stored more efficiently. For very large datasets that don’t fit into memory, HDF5 is a good choice as you can efficiently read and write small sections of much larger arrays.
There are not one but two interfaces to the HDF5 library in Python, PyTables and h5py, each of which takes a different approach to the problem. h5py provides a direct, but high-level interface to the HDF5 API, while PyTables abstracts many of the details of HDF5 to provide multiple flexible data containers, table indexing, querying capability, and some support for out-of-core computations.
pandas also supports reading tabular data stored in Excel 2003 (and higher) files using the ExcelFile class. Interally ExcelFile uses the xlrd and openpyxl packages, so you may have to install them first. To use ExcelFile , create an instance by passing a path to an xls or xlsx file.
Many websites have public APIs providing data feeds via JSON or some other format. There are a number of ways to access these APIs from Python; one easy-to-use method that I recommend is the requests package (http://docs.python-requests.org).
In many applications data rarely comes from text files, that being a fairly inefficient way to store large amounts of data. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, and many alternative non-SQL (so-called NoSQL) databases have become quite popular. The choice of database is usually dependent on the performance, data integrity, and scalability needs of an application.
Loading data from SQL into a DataFrame is fairly straightforward, and pandas has some functions to simplify the process. As an example, I’ll use an in-memory SQLite database using Python’s built-in sqlite3 driver.
pymongo is the official driver for MongoDB. Documents stored in MongoDB are found in collections inside databases. Each running instance of the MongoDB server can have multiple databases, and each database can have multiple collections.
Much of the programming work in data analysis and modeling is spent on data preparation: loading, cleaning, transforming, and rearranging. Sometimes the way that data is stored in files or databases is not the way you need it for a data processing application. Many people choose to do ad hoc processing of data from one form to another using a general purpose programming, like Python, Perl, R, or Java, or UNIX text processing tools like sed or awk.
Data contained in pandas objects can be combined together in a number of built-in ways:
Merge or join operations combine data sets by linking rows using one or more keys. These operations are central to relational databases. The merge function in pandas is the main entry point for using these algorithms on your data.
In [79]:
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df2 = DataFrame({'key': ['a', 'b', 'd'],
'data2': range(3)})
In [80]:
df1.merge(df2)
Out[80]:
In [81]:
pd.merge(df1, df2, on="key")
Out[81]:
By default merge does an 'inner' join; the keys in the result are the intersection. Other possible options are 'left' , 'right' , and 'outer'.
In some cases, the merge key or keys in a DataFrame will be found in its index. In this case, you can pass left_index=True or right_index=True (or both) to indicate that the index should be used as the merge key.
Another kind of data combination operation is alternatively referred to as concatenation, binding, or stacking. NumPy has a concatenate function for doing this with raw NumPy arrays.
In [82]:
arr = np.arange(12).reshape((3, 4))
In [86]:
np.concatenate([arr, arr], axis=1)
Out[86]:
Another data combination situation can’t be expressed as either a merge or concatenation operation. You may have two datasets whose indexes overlap in full or part.
In [87]:
a = Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
index=['f', 'e', 'd', 'c', 'b', 'a'])
b = Series(np.arange(len(a), dtype=np.float64),
index=['f', 'e', 'd', 'c', 'b', 'a'])
In [88]:
b[-1] = np.nan
In [89]:
np.where(pd.isnull(a), b, a)
Out[89]:
In [90]:
b[:-2].combine_first(a[2:])
Out[90]:
There are a number of fundamental operations for rearranging tabular data. These are alternatingly referred to as reshape or pivot operations.
Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
In [96]:
data = DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'], name='number'))
In [97]:
data
Out[97]:
In [98]:
data.stack()
Out[98]:
Note that pivot is just a shortcut for creating a hierarchical index using set_index and reshaping with unstack.
In [99]:
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4]})
In [100]:
data
Out[100]:
In [102]:
data.duplicated()
Out[102]:
In [103]:
data.drop_duplicates()
Out[103]:
In [3]:
import numpy as np
import pandas as pd
from pandas import *
data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
'corned beef', 'Bacon', 'pastrami', 'honey ham',
'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
In [4]:
data
Out[4]:
In [5]:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
In [6]:
data["animal"] = data["food"].map(str.lower).map(meat_to_animal)
In [7]:
data
Out[7]:
In [10]:
data.columns.map(str.upper)
Out[10]:
In [12]:
data.rename(columns=str.upper)
Out[12]:
Continuous data is often discretized or otherwised separated into “bins” for analysis.
In [13]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
In [14]:
bins = [18, 25, 35, 60, 100]
In [15]:
cats = pd.cut(ages, bins)
In [16]:
cats
Out[16]:
In [18]:
cats.codes
Out[18]:
In [19]:
cats.value_counts()
Out[19]:
In [20]:
np.random.seed(12345)
data = DataFrame(np.random.randn(1000, 4))
In [21]:
data.describe()
Out[21]:
In [22]:
data[(np.abs(data) > 3).any(1)]
Out[22]:
In [25]:
df = DataFrame(np.arange(5 * 4).reshape(5, 4))
In [28]:
sampler = np.random.permutation(5)
df.take(sampler)
Out[28]:
Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame containing k columns containing all 1’s and 0’s.
In [29]:
df = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
'data1': range(6)})
In [30]:
df
Out[30]:
In [31]:
pd.get_dummies(df['key'])
Out[31]:
Python has long been a popular data munging language in part due to its ease-of-use for string and text processing. Most text operations are made simple with the string object’s built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed. pandas adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data, additionally handling the annoyance of missing data.
In [37]:
s = 'a,b, guido'
In [38]:
s.split(',')
Out[38]:
In [39]:
[x.strip() for x in s.split(',')]
Out[39]:
In [42]:
','.join(['a', 'b', 'c'])
Out[42]:
In [44]:
import re
text = "foo bar\t baz \tqux"
In [46]:
re.split('\s+', text)
Out[46]:
In [47]:
regex = re.compile('\s+')
In [49]:
regex.split(text)
Out[49]:
In [50]:
regex.findall(text)
Out[50]:
Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.
In [52]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = Series(data)
In [55]:
data.str.contains("gmail")
Out[55]:
In [57]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
In [59]:
data.str.findall(pattern, flags=re.IGNORECASE)
Out[59]:
It may be a part of the exploratory process; for example, helping identify outliers, needed data transformations, or coming up with ideas for models. matplotlib is a (primarily 2D) desktop plotting package designed for creating publication-quality plots. matplotlib has a number of add-on toolkits, such as mplot3d for 3D plots and basemap for mapping and projections.
In [60]:
import matplotlib.pyplot as plt
%matplotlib inline
In [61]:
plt.plot(np.arange(10))
Out[61]:
Plots in matplotlib reside within a Figure object. You can create a new figure with plt.figure
In [79]:
fig = plt.figure()
ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)
ax3 = fig.add_subplot(2, 2, 3)
plt.plot(np.random.randn(50).cumsum(), 'k--')
_ = ax1.hist(np.random.randn(100), bins=20, color='k', alpha=0.3)
ax2.scatter(np.arange(30), np.arange(30) + 3 * np.random.randn(30))
Out[79]:
The spacing can be most easily changed using the subplots_adjust Figure method. wspace and hspace controls the percent of the figure width and figure height, respectively, to use as spacing between subplots.
The pyplot interface, designed for interactive use, consists of methods like xlim, xticks, and xticklabels . These control the plot range, tick locations, and tick labels, respectively. All such methods act on the active or most recently-created AxesSubplot . Each of them corresponds to two methods on the subplot object itself; in the case of xlim these are ax.get_xlim and ax.set_xlim.
In [89]:
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(np.random.randn(1000).cumsum())
ticks = ax.set_xticks([0, 250, 500, 750, 1000])
ax.set_title('My first matplotlib plot')
ax.set_xlabel('Stages')
Out[89]:
In [92]:
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(np.random.randn(1000).cumsum(), 'k', label='one')
ax.plot(np.random.randn(1000).cumsum(), 'k--', label='two')
ax.plot(np.random.randn(1000).cumsum(), 'k.', label='three')
ax.legend(loc='best')
Out[92]:
Annotations and text can be added using the text , arrow , and annotate functions. text draws text at given coordinates (x, y) on the plot with optional custom styling.
Drawing shapes requires some more care. matplotlib has objects that represent many common shapes, referred to as patches. Some of these, like Rectangle and Circle are found in matplotlib.pyplot , but the full set is located in matplotlib.patches.
The active figure can be saved to file using plt.savefig . This method is equivalent to the figure object’s savefig instance method.
matplotlib comes configured with color schemes and defaults that are geared primarily toward preparing figures for publication. Fortunately, nearly all of the default behavior can be customized via an extensive set of global parameters governing figure size, subplot spacing, colors, font sizes, grid styles, and so on. There are two main ways to interact with the matplotlib configuration system. The first is programmatically from Python using the rc method.
For more extensive customization and to see a list of all the options, matplotlib comes with a configuration file matplotlibrc in the matplotlib/mpl-data directory. If you customize this file and place it in your home directory titled .matplotlibrc , it will be loaded each time you use matplotlib.
matplotlib is actually a fairly low-level tool. Therefore, pandas has an increasing number of high-level plotting methods for creating standard visualizations that take advantage of how data is organized in DataFrame objects.
Series and DataFrame each have a plot method for making many different plot types. By default, they make line plots.
In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
s = pd.Series(np.random.randn(10).cumsum(), index=np.arange(0, 100, 10))
In [7]:
s.plot()
Out[7]:
In [9]:
df = pd.DataFrame(np.random.randn(10, 4).cumsum(0),
columns=['A', 'B', 'C', 'D'],
index=np.arange(0, 100, 10))
In [11]:
df
Out[11]:
In [10]:
df.plot()
Out[10]:
Making bar plots instead of line plots is a simple as passing kind='bar' (for vertical bars) or kind='barh' (for horizontal bars). In this case, the Series or DataFrame index will be used as the X ( bar ) or Y ( barh ) ticks.
In [13]:
data = pd.Series(np.random.rand(16), index=list('abcdefghijklmnop')).plot(kind="bar")
In [14]:
data = pd.Series(np.random.rand(16), index=list('abcdefghijklmnop')).plot(kind="barh")
A histogram, with which you may be well-acquainted, is a kind of bar plot that gives a discretized display of value frequency. The data points are split into discrete, evenly spaced bins, and the number of data points in each bin is plotted.
A related plot type is a density plot, which is formed by computing an estimate of a continuous probability distribution that might have generated the observed data. A usual procedure is to approximate this distribution as a mixture of kernels , that is, simpler distributions like the normal (Gaussian) distribution.
Scatter plots are a useful way of examining the relationship between two one-dimensional data series. matplotlib has a scatter plotting method that is the workhorse of making these kinds of plots.
Ushahidi is a non-profit software company that enables crowdsourcing of information related to natural disasters and geopolitical events via text message.
GroupBy uses split-apply-combine:
In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object.
In [3]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})
In [4]:
df
Out[4]:
In [11]:
df['data1'].groupby(df['key1']).mean()
Out[11]:
In [12]:
df['data1'].groupby([df['key1'], df['key2']]).mean()
Out[12]:
In [16]:
for name, group in df.groupby('key1'):
print name, group
In [17]:
dict(list(df.groupby('key1')))
Out[17]:
In [21]:
list(df.groupby('key1')['data1'])
Out[21]:
In [22]:
list(df['data1'].groupby(df['key1']))
Out[22]:
In [26]:
people = pd.DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
In [27]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
'd': 'blue', 'e': 'red', 'f' : 'orange'}
In [29]:
people.groupby(mapping, axis=1).mean()
Out[29]:
In [31]:
map_series = pd.Series(mapping)
In [32]:
people.groupby(map_series, axis=1).count()
Out[32]:
In [33]:
people.groupby(len).sum()
Out[33]:
In [34]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
[1, 3, 5, 1, 3]], names=['cty', 'tenor'])
In [36]:
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
In [37]:
hier_df
Out[37]:
In [41]:
hier_df.groupby(level='cty', axis=1).count()
Out[41]:
Any data transformation that produces scalar values from arrays. For e.g mean, count, min and sum.
In [50]:
df = pd.DataFrame(np.random.randn(1, 12).reshape(3, 4))
In [51]:
df
Out[51]:
In [63]:
df.groupby(0).agg('mean')
Out[63]:
You can disable this behavior in most cases by passing as_index=False to groupby.